barcodetrackR
Introduction
Motivation
barcodetrackR is an R package developed for the analysis and visualization of clonal tracking data from cellular barcoding experiments.
Installation
Currently, barcodetrackR is available at Github and can be downloaded using the devtools package.
Contributors
The R package and functions were created by Diego A. Espinoza, Ryland Mortlock, Samson J. Koelle, and others at Cynthia Dunbar’s laboratory at the National Heart, Lung, and Blood Institutes of Health. Issues should be addressed to https://github.com/d93espinoza/barcodetrackR/issues.
Loading data
Loading required packages
The barcodetrackR package operates on SummarizedExperiment objects from the Bioconductor repository. It stores associated colData for each sample in this object as well as any metadata. We load the barcodetrackR and SummarizedExperiment packages here for our analyses, as well as the magrittr package in order to improve legibility of code through using the pipe %>% operator.
Creating objects with create_SE
For this vignette, we will load publically available data from the following papers (these sample datasets are included in the R package):
- Wu, Chuanfeng, et al. “Clonal expansion and compartmentalized maintenance of rhesus macaque NK cell subsets.” Science Immunology (2018)
- Belderbos, Mirjam E., et al. “Donor-to-Donor Heterogeneity in the Clonal Dynamics of Transplanted Human Cord Blood Stem Cells in Murine Xenografts.” Biology of Blood and Marrow Transplantation (2020)
- Six, Emmanuelle, et al. “Clonal tracking in gene therapy patients reveals a diversity of human hematopoietic differentiation programs.” Blood (2020)
system.file("sample_data/WuC_etal/monkey_ZJ31.txt", package = "barcodetrackR") %>%
read.delim(row.names = 1) -> wu_dataframe
system.file("sample_data/WuC_etal/monkey_ZJ31_metadata.txt", package = "barcodetrackR") %>%
read.delim() -> wu_metadata
wu_SE <- create_SE(your_data = wu_dataframe,
meta_data = wu_metadata,
threshold = 0)
system.file("sample_data/BelderbosME_etal/mouse_UBC_C22.txt", package = "barcodetrackR") %>%
read.delim(row.names = 1) -> belderbos_dataframe
system.file("sample_data/BelderbosME_etal/mouse_UBC_C22_metadata.txt", package = "barcodetrackR") %>%
read.delim() -> belderbos_metadata
belderbos_SE <- create_SE(your_data = belderbos_dataframe,
meta_data = belderbos_metadata,
threshold = 0)
system.file("sample_data/SixE_etal/WAS5_reads.txt", package = "barcodetrackR") %>%
read.delim(row.names = 1) -> six_dataframe
system.file("sample_data/SixE_etal/WAS5_metadata.txt", package = "barcodetrackR") %>%
read.delim() -> six_metadata
six_SE <- create_SE(your_data = six_dataframe,
meta_data = six_metadata,
threshold = 0)In addition to reads data, the Six et al. paper includes “estimated abundance” data for each of these timepoints. We load these into a new SE in order to compare our analyses downstream.
system.file("sample_data/SixE_etal/WAS5_estabundance.txt", package = "barcodetrackR") %>%
read.delim( row.names = 1) -> six_dataframe_2
six_ea_SE <- create_SE(your_data = six_dataframe_2,
meta_data = six_metadata,
threshold = 0)Our input dataframes to create the SummarizedExperiment (SE) objects are each an n x m data.frame where there are n rows of observations (typically cellular barcodes, insertion sites, or the like) and the m columns are the samples. The input metadata must have row order identical to the order of the colums in its corresponding dataframe. The metadata must also have a column titled SAMPLENAME that denotes the column of your_data it refers to.
Assays created by create_SE
create_SE takes the input dataframe and metadata and creates a SummarizedExperiment object with the following assays:
counts: the raw values from the input dataframepercentages: the per-column proportions of each entry in each columnranks: the rank of each entry in each columnnormalized: the normalized read values (CPM)logs: the log of the normalized values
## List of length 5
## names(5): counts percentages ranks normalized logs
Correlations
scatter_plot
A straightforward way to view the relationship between samples in a pairwise manner is to view basic scatter plots of two samples using the provided assays. We provide a scatter_plot function as part of the package.
Here, we view the correlation of barcode counts between different cell types at the 20 month timepoint of the Wu et al study. We compare granulocytes (Gr) to B and T cells.
Gr_B_20 <- c("ZJ31_20m_Gr", "ZJ31_20m_B")
Gr_T_20 <- c("ZJ31_20m_Gr", "ZJ31_20m_T")
wu_scatterplot_1 <- scatter_plot(wu_SE[,Gr_B_20], your_title = "Gr vs B")
wu_scatterplot_2 <- scatter_plot(wu_SE[,Gr_T_20], your_title = "Gr vs T")
cowplot::plot_grid(wu_scatterplot_1, wu_scatterplot_2, ncol = 2)cor_plot
A more comprehensive way to view the relationship between samples in a pairwise manner is to use a correlation plot. Here, we view the Pearson correlation between the T, B, Gr, NK CD56+/CD16-, and NK CD56-/CD16+ fractions within the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.
wu_cor_plot_sample_selection <- colData(wu_SE)$SAMPLENAME[1:20]
cor_plot(wu_SE[,wu_cor_plot_sample_selection],
method_corr = "pearson",
plot_type = "color")We can also use the spearman correlation as our metric of interest, and visualize our correlation plot using circles with areas proportional to the absolute value of the correlation and colors corresponding to the value of the correlations.
We can return a table of the Pearson correlations as well as the p-values and confidence intervals for each of the comparisons above.
cor_plot(wu_SE[,wu_cor_plot_sample_selection],
method_corr = "pearson",
return_table = TRUE) %>% head## sample_i sample_j correlation_value p_value ci_lo ci_hi
## 1 ZJ31_6m_T ZJ31_6m_T 1.0000000 0.000000e+00 1.0000000 1.0000000
## 2 ZJ31_6m_T ZJ31_9.5m_T 0.4672994 0.000000e+00 0.4527245 0.4816246
## 3 ZJ31_6m_T ZJ31_12m_T 0.4406121 0.000000e+00 0.4261147 0.4548832
## 4 ZJ31_6m_T ZJ31_20m_T 0.3730234 0.000000e+00 0.3564373 0.3893744
## 5 ZJ31_6m_T ZJ31_6m_B 0.2263901 7.739959e-201 0.2122372 0.2404479
## 6 ZJ31_6m_T ZJ31_9.5m_B 0.2346975 2.001192e-191 0.2197056 0.2495785
Above, we used two of the three available correlation visualizations ("color" and "circle") using the standard color palette provided. A no_negative parameter is offered as well to eliminate negative correlations within the data, from which deriving biological meaning may be difficult.
When there are a smaller number of samples to analyze, the "number" option can be used to view the actual correlation within the grid. Here is an example comparing the pearson correlation of clones within peripheral blood at subsequent timepoints from the belberdos sample data set.
belderbos_wk_samples_PB <- c("wk10", "wk14", "wk20", "wk27", "wk33")
cor_plot(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
method_corr = "pearson",
plot_type = "number",
number_size = 3)Clonal patterns
barcode_ggheatmap
A useful visualization to study clonal patterns over time is by using a heat map which clusters the top clones based on relatedness and displays their proportion in each sample over time. Our function barcode_ggheatmap does this by choosing the top N clones (n_clones) within each sample and tracking them over time. The argument n_clones assumes that in most cases, the large-contributing clones are of most interest to the user. This assumption can be relaxed by passing a large value to the argument.
We first visualize the top 10 clones from the selected samples in the Wu dataset. The default cell note is stars for the top 10 clones in each sample.
barcode_ggheatmap(your_SE = wu_SE[,wu_cor_plot_sample_selection],
n_clones = 10,
label_size = 14,
cellnote_size = 4)We can also visualize the percentage contribution of each of these clones and add a dendogram which clusters clones based on the Euclidean distance between each clone’s log assay. Here we plot the top 5 clones within the Six dataset. First, we order the columns to group them by celltype.
six_celltype_order <- c("m13_TCELLS", "m36_TCELLS", "m43_TCELLS", "m55_TCELLS",
"m13_BCELLS", "m36_BCELLS", "m43_BCELLS", "m55_BCELLS",
"m13_NKCELLS", "m36_NKCELLS", "m43_NKCELLS","m55_NKCELLS",
"m13_GRANULOCYTES", "m36_GRANULOCYTES", "m43_GRANULOCYTES",
"m55_GRANULOCYTES",
"m13_MONOCYTES", "m36_MONOCYTES", "m43_MONOCYTES","m55_MONOCYTES")
barcode_ggheatmap(your_SE = six_SE[,six_celltype_order],
n_clones = 5,
cellnote_assay = "percentages",
cellnote_size = 2,
label_size = 14,
dendro = TRUE,
clusters = 4,
distance_method = "euclidean")Splitting the clones into 4 clusters makes it easy to identify clones which are biased towards Monocytes (purple), T cells (green), NK cells (light blue), and B cell / Granulocyte (red).
The Belderbros dataset generally contains less clones than either the Wu or Six datasets, likely because the data is from mice. When possible, using the “percentages” option of the cellnote_assay aqrgument provides maximal information.
belderbos_wk_samples_PB <- c("wk10", "wk14", "wk20", "wk27", "wk33")
barcode_ggheatmap(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
n_clones = 10,
label_size = 18,
dendro = TRUE,
cellnote_size = 4,
cellnote_assay = "percentages",
clusters = 4)barcode_binary_heatmap
In some cases, we may be interested in a global view of the presence or absence of barcodes across samples, regardless of read abundance. In that case, a binary heat map can be generated using barcode_binary_heatmap to give the simplest visual representation. Here we view the binary heat map of the belderbos data with a threshold of 0.01, meaning clones that make up less than 1% of a sample are treated as not detected.
barcode_binary_heatmap(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
label_size = 12,
threshold = 0.01)clonal_contribution
Another familiar way to visualize clonal patterns over time is using a line or bar chart showing the proportion of top clones. In the above heat maps for wu data, we could see that there are some large uni-lineage CD56-/CD16+ NK cell clones. We can view the expansion of the top clones from the final timepoint through a stacked area line chart showing the proportion of each clone in CD56-/CD16+ NK cell samples across time.
clonal_contribution(your_SE = wu_SE[,17:20],
SAMPLENAME_choice = "ZJ31_20m_NK_CD56n_CD16p",
n_clones = 10,
graph_type = "line",
plot_over = "months")The colored clones represent the top 10 clones in the 20 month CD56-/CD16+ NK sample. The gray clones are other smaller clones which are found in the CD16+ samples at any of the four timepoints.
For data with fewer clones, a bar chart might be appropriate. We can view the peripheral blood samples from the belderbos dataset, previously shown as a heat map. By turning the plot_non_selected argument to false, we can only look at the top 10 clones from the chosen sample and ignore. We can also use categorical spacing on the x-axis rather than numeric by setting keep_numeric to false.
clonal_contribution(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
SAMPLENAME_choice = "wk33",
n_clones = 10,
graph_type = "bar",
plot_over = "weeks",
keep_numeric = FALSE,
plot_non_selected = FALSE)Clonal bias
bias_histogram
The most straightforward way to view the bias between samples is using a histogram. We include the bias_histogram function which allows users to compare two samples from a given piece of metadata "split_bias_on" faceted by another piece of metadata "split_bias_over", which is illustrated below using the Wu et al data comparing clonal bias between B and T cells from the 6, 9.5, 12, and 20 month post-transplant timepoints.
wu_bias_plot_sample_selection <- colData(wu_SE)$SAMPLENAME[1:20]
bias_histogram(your_SE = wu_SE[,wu_bias_plot_sample_selection],
split_bias_on = "celltype",
bias_1 = "B",
bias_2 = "T",
split_bias_over = "months",
ncols = 2)The stacked bars of the histogram represent individual clones. For the Wu dataset, there are a multitude of small clones so the stacked bars are not visible. We can view clonal bias between B and T cells from the Six, et al dataset which has a smaller number of larger clones.
bias_histogram(your_SE = six_SE,
split_bias_on = "celltype",
bias_1 = "B",
bias_2 = "T",
split_bias_over = "months",
ncols = 2)ridge_plot
An alternative to the histogram is a ridge plot which shows clonal bias between cell types through a density estimation of the number of clones at each value of the log bias. Since the ridge plot treats clonal bias as a continuous variable, it can reveal trends that are masked by grouping into bins with a histogram.
It is important to note that in order to handle clones which have a count of zero in one of the samples, log+1 normalization is used within the ridge plot function. This differs from the histogram where these clones can be grouped into the farthest bins on either side. The log bias formula for the ridge plot function is given by:
\(logbias=log(\frac{normalized_1+1}{normalized_2+1})\)
Here, we view a ridge plot showing clonal bias between B cells and Granulocytes from the Wu dataset. We calculate the density statistic using the cumulative sum of the normalized values for each log2 comparison. We also visualize each clone as a dot on the plot, proportionate to the cumulative sum of the normalized values.
ridge_plot(your_SE = wu_SE,
split_bias_on = "celltype",
bias_1 = "B",
bias_2 = "Gr",
split_bias_over = "months",
weighted = TRUE,
add_dots = TRUE)We again view a comparison in the Wu dataset, this time comparing NK CD56- CD16+ cells to Granulocytes.
ridge_plot(your_SE = wu_SE,
split_bias_on = "celltype",
bias_1 = "NK_CD56n_CD16p",
bias_2 = "Gr",
split_bias_over = "months",
weighted = TRUE,
add_dots = TRUE)bias_distribution_barplot
bias_distribution_barplot(your_SE = wu_SE[,wu_bias_plot_sample_selection],
split_bias_on = "celltype",
bias_1 = "B",
split_bias_over = "months")Right now, error in line 52 that I can’t figure out.
ternary_plot
A ternary plot can be made to visualize shared clonality between three samples. The default for the function is to plot the density of clones. To instead, see the actual clones as dots, the "density_mode" argument can be set to FALSE.
wu_ternary_plot_sample_selection <- c("ZJ31_6m_Gr", "ZJ31_6m_B", "ZJ31_6m_T")
ternary_plot(your_SE = wu_SE[,wu_ternary_plot_sample_selection])A ternary plot for the first three samples from the Six dataset.
This function needs a lot of work, and I will re-visit. But for now, I just made it compatible with the SummarizedExperiment. I think the ternary plot is still worth including and that the density mode should be the default mode.
Diversity
rank_abundance
A way to depict clone richness and evenness within a plot is by using a rank-abundance plot. Here, the cumulative abundance of every clone in a sample is plotted in descending rank from 1 to n (where there are n clones in the sample being plotted), or by scaling all ranks to the range [0,1].
We plot all samples from the Belderbos dataset. Note that the week 10 and 14 samples appear to have the most evenness across detected clones, while the other samples contain both large and small detected clones.
clonal_diversity
Within-sample diversity indices (also referred to as alpha diversities) are indices computed independently for each sample in a data set. With clonal tracking, these diversity indices can give a global indication about the number of species in a sample using the number of detected species as input and sometimes also leveraging the proportional abundances of species within the sample. We include the function clonal_diversity which can calculate three diversity indices (making use the vegan package):
"shannon"\(H'=-\sum _{i=1}^{R}p_{i}\ln p_{i}\)"simpson"\(\lambda =\sum _{i=1}^{R}p_{i}^{2}\)"invsimpson"\({\displaystyle {\frac {1}{\lambda }}={1 \over \sum _{i=1}^{R}p_{i}^{2}}={}^{2}D}\)
We also include "count" as an option for index_type in order to use the total detected clones per sample as a measure for diversity.
As an example, we can plot the shannon diversities for the 5 cell types within the Wu dataset over time.
Here we show the simpson indices for the unsorted samples in the Belderbos dataset over time; the last time point splits the cell fraction in to T, B, and Granulocyte fractions, allowing comparison of their simpson indices.
Similar to the Wu dataset, the Six dataset contains clonal tracking information for the T, B, Gr, Monocyte, and NK lineages. We can plot these over time as well and interstingly observe that the shannon diversity of the NK cell fraction follows a similar pattern to that of the Wu dataset.
mds_plot
Measures of simmilarity or dissimilarity between samples are known as beta-diversity indices (or distances if they are metrics). A common way for depicting these beta-diversity indices are using what are known as PCoA (Principal Coordinate Analysis) plots, in which an input distance matrix is plotted in two dimensions. Again, we leverage the "vegan" package here to call vegandist which allows us to calculate a number of dissimilarity indices between our samples (choosing an assay from the SummarizedExperiment object) and then perform principal coordinates analysis using cmdscale. Note that using "euclidean" as our index is equivalent to performing PCA (Prinicipal Components Analysis) on our data.
One of the most commonly used beta-diversity indices is the (Bray-Curtis Dissimilarity)[https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity]. Here, we find the Bray-Curtis dissimilarity index between all of the samples in the Wu dataset and use PCoA to plot them on two dimensions. From the plot, it is evident that NK cells are most dissimilar from all other celltypes when considering the Bray-Curtis index.
When using the Bray-Curtis dissimilarity index between all of the samples in the Six dataset, we find that similarly to the Wu dataset, NK cells appear dissimilar from the remainder of the celltypes, while Monocytes and Granulocytes appear most similar to one another.
Circos plot
The circos plot is a fun way to view relationships between variables. We make use of the circlize package to show shared clones between compartments as links between regions around a circle. Here, we will start by subetting our data to a single timepoint and comparing just three cell types: T cells, B cells, and Granulocytes.
wu_circos_selection <- c("ZJ31_12m_T","ZJ31_12m_B","ZJ31_12m_Gr")
circos_plot(wu_SE[,wu_circos_selection], plot_label = "celltype")Here, we can see that most clones are present in all three cell types (purple link). There are also clones shared between each pair-wise combination of cell types. Based on the width of the links, you can ascertain that more clones are shared between Granulocytes and B cells (yellow link), followed by B cells and T cells (blue link), then T cells and Granulocytes (green link). The portion of each cell type without a link (empty space) represents the number of clones which are unique to that cell type.
We can also create a weighted circos plot. The difference is that in the previous plot, the width of the links between cell types are proportional to the number of clones shared between those cell types. In the weighted heat map, the width of the links between two cell types is proportional to the proportion of contribution of the shared clones to the overall hematopoiesis in that cell type.
In this example, the circos plots look very similar. But you can see the subtle difference when you look at the blue link showing clones shared between B and T cells. The link is wider when connecting to T cells because the shared Bcell-Tcell clones represent a larger fraction of detected hematopoiesis in the T cell compartment than the B cell compartment.
The circos plot can handle any number of inputs but keep in mind that the number of unique combinations rises exponentially with the number of compartments. Here, we use the alpha parameter to control the transparency of the links. And we plot four cell types rather than three.
wu_circos_selection2 <- c("ZJ31_12m_T","ZJ31_12m_B","ZJ31_12m_Gr","ZJ31_12m_NK_CD56n_CD16p")
circos_plot(wu_SE[,wu_circos_selection2], plot_label = "celltype", alpha = 0.9)The plot is quite involved, but we can quickly draw a few high-level conclusions. A large chunk of clones are shared between all four cell types (darkest purple color). Another large chunk of clones is shared between B cells, T cells, and Grans but not NK cells (lighter purple). Note that each unique color signifies a unique combination of the four cell types.
The regions of the circos plot need not be cell types. They could also be timepoints. Here, we show that example using the belderbos dataset from mouse studies showing only the first four timepoints of peripheral blood samples.
belderbos_wk_samples_PB <- c("wk10", "wk14", "wk20", "wk27", "wk33")
circos_plot(belderbos_SE[,belderbos_wk_samples_PB])From the circos plot, we can see that the wk10 and wk14 timepoints have fewer number of clones detected, as indicated by their smaller bars around the perimeter. There are some clones shared between all timepoints (dark purple links) but most of the clones appear to be present in the wk20, wk27, and wk33 samples but not the wk10 and wk14 (green and yellow links).
When looking at the weighted heat map, one sees that the majority of hematopoiesis in timepoints past 10 weeks is accounted for by the group of clones (turquoise color) that are present at all timepoints except for the wk10 timepoint. There are some clones present at all timepoints (dark purple) and the green and yellow links mostly represent fraction of hematopoiesis in the later timepoints coming from clones first detected at wk14 or wk20.
From this example, one can see that the two plots paint different but complementary pictures. With the regular circos plot, clones are treated as equal so the information on fractional contribution is lost. However, the number of detected clones is indicated by the length of each region around the perimeter of the circos. This information can be useful for some studies and it is lost in the weighted heat map since all compartments poportion adds to 100%. We recommend using the regular and circos plot in combination to obtain maximal information from any given dataset.